library(tidyverse)
library(plotly)
library(ggplot2)
library(knitr)
library(readr)
library(dplyr)
library(tidyr)
library(scales)
In this chapter, we analyze the relationship between inspection
violations and grades. Our analysis focuses exclusively on data where
the grades are A, B, or C, and excludes records with grades labeled as N
(Not Yet Graded), Z (Grade Pending), or P (Grade Pending issued upon
reopening after a closure). We also exclude “Not Applicable” entries for
critical violation indicators.
Inspection violations are
categorized into two groups: Critical and Not Critical. Critical
violations represent the most significant risks for foodborne illnesses.
To ensure accuracy and relevance, we begin by filtering out unnecessary
data from the dataset before proceeding with the analysis.
# The same data cleaning process in demographics part, also only consider data with valid violation_description and critical_flag, and grade A, B, C.
manhattan_data = read_csv("Manhattan_Restaurant_Inspection_Results.csv", na = c("NA", "", "."))
str (manhattan_data)
## spc_tbl_ [94,616 × 27] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ CAMIS : num [1:94616] 50140436 50158081 50152703 50160975 50161087 ...
## $ DBA : chr [1:94616] "JUST SALAD" "THE MANNER" "MIDNIGHT BLUE" "BLUE BLOSSOM" ...
## $ BORO : chr [1:94616] "Manhattan" "Manhattan" "Manhattan" "Manhattan" ...
## $ BUILDING : chr [1:94616] "2853" "58" "106" "108" ...
## $ STREET : chr [1:94616] "BROADWAY" "THOMPSON STREET" "EAST 19 STREET" "WEST 39 STREET" ...
## $ ZIPCODE : num [1:94616] 10025 10012 10003 10018 10025 ...
## $ PHONE : num [1:94616] 7.32e+09 9.17e+09 3.48e+09 6.47e+09 9.29e+09 ...
## $ CUISINE DESCRIPTION : chr [1:94616] NA NA NA NA ...
## $ INSPECTION DATE : chr [1:94616] "01/01/1900" "01/01/1900" "01/01/1900" "01/01/1900" ...
## $ ACTION : chr [1:94616] NA NA NA NA ...
## $ VIOLATION CODE : chr [1:94616] NA NA NA NA ...
## $ VIOLATION DESCRIPTION: chr [1:94616] NA NA NA NA ...
## $ CRITICAL FLAG : chr [1:94616] "Not Applicable" "Not Applicable" "Not Applicable" "Not Applicable" ...
## $ SCORE : num [1:94616] NA NA NA NA NA NA NA NA NA NA ...
## $ GRADE : chr [1:94616] NA NA NA NA ...
## $ GRADE DATE : chr [1:94616] NA NA NA NA ...
## $ RECORD DATE : chr [1:94616] "11/05/2024" "11/05/2024" "11/05/2024" "11/05/2024" ...
## $ INSPECTION TYPE : chr [1:94616] NA NA NA NA ...
## $ Latitude : num [1:94616] 40.8 40.7 40.7 40.8 40.8 ...
## $ Longitude : num [1:94616] -74 -74 -74 -74 -74 ...
## $ Community Board : num [1:94616] 109 102 105 105 107 108 101 105 104 102 ...
## $ Council District : chr [1:94616] "07" "01" "02" "04" ...
## $ Census Tract : chr [1:94616] "019900" "004700" "005000" "011300" ...
## $ BIN : num [1:94616] 1075440 1087362 1017905 1015273 1055676 ...
## $ BBL : num [1:94616] 1.02e+09 1.00e+09 1.01e+09 1.01e+09 1.02e+09 ...
## $ NTA : chr [1:94616] "MN09" "MN24" "MN21" "MN17" ...
## $ Location Point1 : logi [1:94616] NA NA NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. CAMIS = col_double(),
## .. DBA = col_character(),
## .. BORO = col_character(),
## .. BUILDING = col_character(),
## .. STREET = col_character(),
## .. ZIPCODE = col_double(),
## .. PHONE = col_double(),
## .. `CUISINE DESCRIPTION` = col_character(),
## .. `INSPECTION DATE` = col_character(),
## .. ACTION = col_character(),
## .. `VIOLATION CODE` = col_character(),
## .. `VIOLATION DESCRIPTION` = col_character(),
## .. `CRITICAL FLAG` = col_character(),
## .. SCORE = col_double(),
## .. GRADE = col_character(),
## .. `GRADE DATE` = col_character(),
## .. `RECORD DATE` = col_character(),
## .. `INSPECTION TYPE` = col_character(),
## .. Latitude = col_double(),
## .. Longitude = col_double(),
## .. `Community Board` = col_double(),
## .. `Council District` = col_character(),
## .. `Census Tract` = col_character(),
## .. BIN = col_double(),
## .. BBL = col_double(),
## .. NTA = col_character(),
## .. `Location Point1` = col_logical()
## .. )
## - attr(*, "problems")=<externalptr>
violation_data = manhattan_data %>%
janitor::clean_names() %>%
filter(
!is.na(dba),
!is.na(cuisine_description),
!is.na(grade),
!is.na(score),
!is.na(zipcode),
!is.na(violation_description),
!is.na(critical_flag),
grade %in% c("A", "B", "C")
) %>% mutate(region = case_when(
zipcode >= 10000 & zipcode <= 10025 ~ "Downtown",
zipcode >= 10026 & zipcode <= 10040 ~ "Midtown",
zipcode >= 10041 & zipcode <= 10282 ~ "Uptown",
TRUE ~ "Other" # For ZIP codes like 11371, 12345, etc.
))
First, we analyze the frequency distribution of grades. For both Critical and Not Critical groups, the majority of restaurants received a Grade A, followed by Grade B, and then Grade C. However, only within Grade A do we observe more restaurants with Not Critical violations compared to Critical violations.
# Analyze the effect of violation_description on grade
# Group by critical flag and grade to calculate counts
violation_grade_summary <- violation_data %>%
group_by(critical_flag, grade) %>%
summarize(count = n(), .groups = "drop")
# Calculate average scores by CRITICAL_FLAG and grade
average_score <- violation_data %>%
group_by(critical_flag, grade) %>%
summarize(avg_score = mean(score), .groups = "drop")
# Create an EDA of grade and count using plotly
violation_grade_summary %>%
plot_ly(
x = ~grade,
y = ~count,
type = 'bar',
color = ~critical_flag,
colors = "viridis") %>%
layout(
title = "Effect of Violation Type on Restaurant Grade",
xaxis = list(title = "Grade"),
yaxis = list(title = "Restaurant Number")
)
Additionally, when examining inspection scores across all grade categories, restaurants with Not Critical violations consistently achieve lower scores. This pattern aligns with the corresponding grade results, further highlighting the relationship between inspection outcomes and grade assignments.
# Create an EDA of score and count using plotly
average_score %>%
plot_ly(
x = ~grade,
y = ~avg_score,
type = 'bar',
color = ~critical_flag,
colors = "viridis") %>%
layout(
title = "Effect of Violation Type on Restaurant Score",
xaxis = list(title = "Grade"),
yaxis = list(title = "Restaurant Average Score")
)
Next, we conducted a Chi-Square Test to examine the relationship between the critical flag and grade. The results are visualized using a Contingency Table Heatmap. The analysis shows a p-value smaller than 0.05, leading us to reject the null hypothesis that the critical flag and grade are independent. This indicates a significant relationship between the critical flag and grade at the 0.05 significance level.
# Chi-Square Test
contingency_table <- table(violation_data$critical_flag, violation_data$grade)
chi_test <- chisq.test(contingency_table)
print(chi_test)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 821.92, df = 2, p-value < 2.2e-16
# Convert the contingency table to a matrix for visualization
contingency_matrix <- as.matrix(contingency_table)
# Extract chi-square test statistic and p-value
chi_stat <- round(chi_test$statistic, 2)
chi_pval <- format.pval(chi_test$p.value, digits = 3)
# Add the chi-square test result as annotations
plot_ly(
x = colnames(contingency_matrix),
y = rownames(contingency_matrix),
z = contingency_matrix,
type = "heatmap",
colors = colorRamp(c("white", "purple"))
) %>%
layout(
title = "Contingency Table Heatmap with Chi-Square Result",
xaxis = list(title = "Grade"),
yaxis = list(title = "Critical Flag"),
colorbar = list(title = "Count"),
annotations = list(
list(
x = 2.1,
y = 1.1,
text = paste("Chi-Square Statistic:", chi_stat, "<br>P-Value:", chi_pval),
showarrow = FALSE,
font = list(size = 14)
)
)
)
## Warning: 'layout' objects don't have these attributes: 'colorbar'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'
Then, we analyze the relationship between restaurant cuisine type and
violation. To identify patterns, we calculate the percentage of critical
violations for each cuisine type and rank them accordingly. Our analysis
highlights the top 10 cuisine types with the highest percentage of
violations, with Chinese/Japanese cuisine emerging as the category with
the highest violation rate.
This finding provides valuable insights
into how specific cuisine types are associated with inspection outcomes,
offering a deeper understanding of potential factors influencing
violation frequencies. Such insights could guide further investigation
into operational practices or compliance challenges unique to these
cuisine categories.
# Group by cuisine_description and calculate total and critical counts
violation_cuisine <- violation_data %>%
mutate(critical_flag = ifelse(critical_flag == "Critical", 1, 0)) %>%
group_by(cuisine_description) %>%
summarize(total_violations = n(),
critical_violations = sum(critical_flag, na.rm = TRUE),
critical_percent = (critical_violations / total_violations),
critical_percentage = percent(critical_percent, accuracy = 0.01)
) %>%
arrange(desc(critical_percentage)) %>% # Sort by percentage of critical violations
head(10) # View top 10 restaurant types with the highest percentage of critical violations
plot_ly(
type = 'table',
header = list(
values = c("Cuisine Description", "Critical Percentage"),
align = c("center", "center"),
font = list(size = 14, color = "white"),
fill = list(color = "purple")
),
cells = list(
values = rbind(violation_cuisine$cuisine_description, violation_cuisine$critical_percentage),
align = c("center", "center"),
font = list(size = 12),
fill = list(color = c("white", "white"))
)
)
Among all restaurants with critical violations, we aimed to identify the most common types of violations. To achieve this, we calculated the percentage of each violation type and grouped all categories with less than 3% into an “Other” category for clarity. The results are summarized in a pie chart, providing a clear visualization of the distribution of critical violations.
violation_critical <- violation_data %>%
filter(critical_flag == "Critical") %>% # 19,953 rows in total
group_by(violation_description) %>%
summarize(num_violation = n(),
violation_percent = num_violation / 19953,
violation_percentage = percent(violation_percent, accuracy = 0.01)) %>%
mutate(
violation_description = if_else(violation_percent < 0.03, "Other", violation_description) # Reassign small categories to "Other"
) %>%
arrange(desc(violation_percent))
critical_total <- violation_critical %>%
group_by(violation_description) %>%
summarize(
violation_percent = sum(violation_percent),
violation_percentage = percent(violation_percent, accuracy = 0.01)) #combine percentage of "other"
critical_name = critical_total %>%
mutate(
violation_name = case_when(
violation_description == "Cold TCS food item held above 41 °F; smoked or processed fish held above 38 °F; intact raw eggs held above 45 °F; or reduced oxygen packaged (ROP) TCS foods held above required temperatures except during active necessary preparation." ~ "Temperature of Storage",
violation_description == "Evidence of mice or live mice in establishment's food or non-food areas." ~ "Mice Problem",
violation_description == "Filth flies or food/refuse/sewage associated with (FRSA) flies or other nuisance pests in establishment’s food and/or non-food areas. FRSA flies include house flies, blow flies, bottle flies, flesh flies, drain flies, Phorid flies and fruit flies." ~ "Flies Problem",
violation_description == "Food Protection Certificate (FPC) not held by manager or supervisor of food operations." ~ "FPC",
violation_description == "Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred." ~ "Washing Problem",
violation_description == "Food, supplies, or equipment not protected from potential source of contamination during storage, preparation, transportation, display, service or from customer’s refillable, reusable container. Condiments not in single-service containers or dispensed directly by the vendor." ~ "Contamination Problem",
violation_description == "Hot TCS food item not held at or above 140 °F." ~ "TCS",
violation_description == "Raw, cooked or prepared food is adulterated, contaminated, cross-contaminated, or not discarded in accordance with HACCP plan." ~ "HACCP",
violation_description == "Sanitized equipment or utensil, including in-use food dispensing utensil, improperly used or stored." ~ "Sanitization Problem",
TRUE ~ "Other" # Default category for any other description
)
)
# Create a pie chart
plot_ly(
data = critical_name,
labels = ~violation_name,
values = ~violation_percent,
type = 'pie', # Pie chart type
textinfo = 'label', # Display labels and percentages
hoverinfo = 'label+percent' , # Show additional info on hover
colors = "viridis"
) %>%
layout(title = "Violation Description Percentages",
legend = list( x = 1.2, y = 0.5 ))
Description of violation names in the above pie chart:
**
Temperature of Storage: Cold TCS food item held above 41 °F; smoked or
processed fish held above 38 °F; intact raw eggs held above 45 °F; or
reduced oxygen packaged (ROP) TCS foods held above required temperatures
except during active necessary preparation.
** Mice Problem:
Evidence of mice or live mice in establishment’s food or non-food areas.
** Flies Problem: Filth flies or food/refuse/sewage associated with
(FRSA) flies or other nuisance pests in establishment’s food and/or
non-food areas. FRSA flies include house flies, blow flies, bottle
flies, flesh flies, drain flies, Phorid flies and fruit flies.
**
FPC: Food Protection Certificate (FPC) not held by manager or supervisor
of food operations.
** Washing Problem: Food contact surface not
properly washed, rinsed and sanitized after each use and following any
activity when contamination may have occurred.
** Contamination
Problem: Food, supplies, or equipment not protected from potential
source of contamination during storage, preparation, transportation,
display, service or from customer’s refillable, reusable container.
Condiments not in single-service containers or dispensed directly by the
vendor.
** TCS: Hot TCS food item not held at or above 140 °F.
** HACCP: Raw, cooked or prepared food is adulterated, contaminated,
cross-contaminated, or not discarded in accordance with HACCP plan.
** Sanitization Problem: Sanitized equipment or utensil, including
in-use food dispensing utensil, improperly used or stored.
**
Other: Any violation description not listed in the above categories.
The analysis reveals that the most frequent critical violation
is related to washing and temperature problem. These are more than a
quarter in total percentage. This finding underscores the importance of
proper temperature management and remain cleaning in ensuring food
safety and reducing the risk of foodborne illnesses. Also, restaurants
owner who want to avoid bad grade in inspection need to perticullar pay
attention to them.
::